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.•The  overall  objective  of  this  research  proposal  is  semi-parametric  inference 
of  the  Cox  regression  model  for  a  survival  function  Pr(X  >  x\ Z  =  z)  =  S(x\z)  =  [S^x)]6  , 
where  X  is  subject  to  interval  censoring,  Z  represents  the  covariates,  SQ  is  a  baseline  sur¬ 
vival  function,  and  /?  represents  the  regression  coefficients.  One  objective  of  our  research  is 
to  develop  asymptotic  inference  of  the  generalized  maximum  likelihood  estimator  (GMLE) 
of  the  regression  coefficients  /?  and  S(*|z).  A  critical  limitation  with  the  GMLE  approach 
under  interval  censoring  is  that  it  is  computationally  feasible  only  for  a  small  data  set.  Thus 
the  focus  of  another  aspect  of  our  research  is  the  investigation  of  a  simple  alternative  to 
the  GMLE  obtained  by  a  two-step  estimation  procedure  involving  data  grouping.  In  the 
second  year  of  our  research,  we  have  established  consistency  results  for  the  GMLE  and 
the  two-step  estimators  (TSE)  of  ft  and  S(s|z).  The  results  will  be  useful  to  breast  can¬ 
cer  researchers  pursuing  chemopre  vent  ion  intervention  trials  involving  surrogate  endpoint 
biomarkers,  and  genetic  epidemiologists  conducting  studies  on  familial  aggregation  of  breast 
cancer  and  related  cancers. 
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B.  INTRODUCTION 


Interval-censored  (IC)  data  are  encountered  in  three  areas  of  breast  cancer  research. 
The  most  common  application  is  in  clinical  relapse  follow-up  studies  in  which  the  study 
endpoint  is  disease-free  survival.  When  a  patient  relapses,  it  is  usually  known  that  the 
relapse  takes  place  between  two  follow-up  visits,  and  the  exact  time  to  relapse  is  unknown. 
In  statistics,  we  say  relapse  time  is  interval  censored.  Interval  censoring  is  also  encountered 
in  breast  cancer  registry  studies  in  which  information  on  family  history  of  cancer  is  updated 
periodically.  The  Strang  Breast  Surveillance  Program  for  women  at  increased  risk  for  breast 
cancer,  for  instance,  has  enlisted  over  800  women  with  complete  pedigree  information  which 
is  verified  and  updated  continuously.  Family  history  data  such  as  age  at  diagnosis  of  a 
specific  cancer,  or  a  benign  but  risk-conferring  condition,  are  obtained  from  each  registrant 
at  each  update.  Time  to  a  cancer  event,  and  definitely  time  to  first  detection  of  a  benign 
condition,  are  at  best  known  to  fall  in  the  time  interval  between  the  last  update  and  age 
at  diagnosis.  A  third  but  increasingly  important  area  of  application  of  interval  censoring 
is  in  breast  cancer  chemoprevention  experiments  or  prevention  trials,  which  involve  the 
observation  of  one  or  more  surrogate  endpoint  biomarkers  (SEB)  over  time.  The  scientific 
question  of  interest  here  is  the  estimation  of  time  for  the  SEB  to  reach  a  target  value, 
and  time  from  cessation  of  intake  of  a  chemopreventive  agent  to  the  loss  of  its  protective 
effect.  Unfortunately,  the  exact  values  of  both  these  time  variables  are  known  only  to  lie  in 
between  two  successive  assay  inspection  times.  In  a  breast  cancer  follow-up  study,  we  will 
often  encounter  covariates  (for  instance,  tumor  size  and  nodal  status  in  a  relapse  study,  and 
baseline  SEB  value  in  a  chemoprevention  trial). 

Let  X  denote  a  time-to-event  variable  with  distribution  F( x)  =  Pr(X  <  x),  or  equiv¬ 
alently,  survival  function  S(x)  =  1  -  F(x).  In  interval  censoring,  X  is  not  observed  and 
is  known  only  to  lie  in  an  observable  interval  ( L ,  R).  In  our  previous  DOD  funded  grant, 
we  have  made  fundamental  contributions  to  both  the  theory  of  the  generalized  maximum 
likelihood  (GML)  estimation  of  S,  and  the  computation  in  connection  with  the  inference  of 
GML  estimator  (GMLE)  S  of  S.  These  contributions  are  restricted  to  the  case  of  univariate 
interval-censored  data  without  covariates. 

The  Cox  proportional  hazards  model  [1]  specifies  that  covariates  have  a  proportional 
effect  on  the  hazard  function  of  X.  This  model  provides  powerful  means  for  fitting  failure 
time  observations  to  a  distribution  free  model  and  for  estimating  the  risk  for  failure  associ¬ 
ated  with  a  vector  of  covariates.  It  is  extensively  used  for  right-censored  data.  Finkelstein 
[2]  applied  the  Cox  model  to  analysis  of  interval-censored  data.  However,  she  did  not  estab- 
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lish  asymptotic  properties  of  the  GMLE  of  the  parameters  in  the  model  and  the  approach 
is  limited  to  small  sample  sizes  due  to  the  computational  difficulty  . 

Our  interest  in  IC  data  with  covariates  is  driven  by  needs  arising  from  two  related 
areas  of  breast  cancer  research  at  Strang.  First,  our  investigators  in  the  Strang  Cancer 
Genetics  Program  want  to  study  various  patterns  of  familial  aggregation  of  breast,  ovarian 
and  other  forms  of  cancer  using  family  history  data  from  the  Strang  Breast  Surveillance 
Program.  Studies  of  familial  early  onset  of  breast  cancer,  breast-ovarian  and  breast-prostate 
associations  will  lead  to  IC  data  with  covariates;  therefore,  a  proper  statistical  procedure 
together  with  a  feasible  software  to  deal  with  such  data  are  very  much  needed.  Second, 
we  conducted  a  one-year  chemoprevention  trial  of  indole-3-carbinol  (I3C)  for  breast  cancer 
prevention.  In  this  prevention  trial  we  monitored  the  levels  of  two  SEB’s,  a  urinary  estrogen 
metabolite  ratio  and  a  blood  counterpart,  both  of  which  are  subject  to  interval  censoring. 
An  earlier  dose-ranging  study  of  I3C  conducted  by  Wong  et  al  [3]  has  been  published. 

The  overall  aim  of  this  research  proposal  is  to  develop  statistical  inference  for  interval- 
censored  data  with  covariates  that  are  encountered  in  breast  cancer  chemoprevention  trials 
employing  surrogate  endpoint  biomarkers,  and  in  breast  cancer  registry  follow-up  studies  of 
familial  aggregation  of  breast  and  other  forms  of  cancer.  Asymptotic  generalized  maximum 
likelihood  theory  under  the  Cox  regression  model  will  be  investigated  and  computer  software 
package  for  maximum  likelihood  inference  will  be  implemented. 

C.  BODY 

C.l.  Model  Formulation  and  Likelihood  Equations. 

Let  Yk, i  <  Yk, 2  <  •  •  •  <  Yk,k  denote  the  follow-up  times  for  a  patient  who  has  made 
K  follow-up  visits,  in  a  longitudinal  follow-up  study.  Since  the  number  of  visits  for  each 
patient  may  vary,  K  is  a  random  positive  integer.  For  convenience,  define  Yk, o  =  0  and 
Yk,k+ i  =  oo.  The  time-to-event  variable  of  interest,  X,  is  not  directly  observed;  instead,  it 
is  known  to  lie  in  between  two  successive  censoring  time  points  (Ykj^Ykj+i),  where  j  =0, 
...,  K.  Note  that  X  is  left  censored  if  j  =  0,  strictly  interval  censored  if  0  <  j  <  K,  and 
right  censored  if  X  >  Yk,k •  The  observable  interval- censored  data  corresponding  to  X  is 
given  by 

(L,R)  =  {YK,i,YK,i+ 1)  if  YK>i  <X<  Yjc.i+1,  i  =  0,1,...,  AT.  (1) 

In  addition  to  ( L,R ),  we  also  observe  apxl  covariate  vector  Z.  We  assume  that  K 
and  the  Yfcj’s  are  independent  of  (X,  Z). 
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The  Cox  regression  model  for  the  survival  function  at  X  —  x  given  Z  —  z  is  represented 

by 

s(*|*)  =  [s„(*)]*"\  (2) 

where  z(3  is  the  dot  product  of  Z  and  (3,  S0(x)  is  a  baseline  survival  function  and  f3  is  a 
p-dimensional  regression  coefficient  vector. 

Let  Ii  =  ( Li,Ri,Zi ),  i  =  1,  n,  be  a  random  sample  of  size  n  interval-censored 
observations  with  covariates.  In  terms  of  the  original  observed  intervals,  the  likelihood 
function  of  S  and  b  is  given  by 

L  =  n((S(Ij)r“‘-(S(Bi))e‘’‘).  (3) 

2  —  1 

where  5  is  a  survival  function,  and  b  is  a  p  x  1  dimensional  vector.  The  GMLE  of  (S0,  (3)  is 
a  value  (S,  b )  that  maximizes  (3)  over  all  survival  functions  S  and  all  b  £  7 Zp. 

Since  SQ  places  all  probability  mass  on  the  innermost  intervals  of  the  /*’ s  (see  Peto 
[4]  or  Turnbull  [5]),  it  is  often  computationally  simpler  to  express  L  in  terms  of  innermost 
intervals. 

We  say  that  an  interval  A  is  an  innermost  interval  of  the  Ii  s  if  A  is  a  nonempty  finite 

intersection  of  one  or  more  of  the  Ii  s  such  that  either  Ii  fl  A  =  0  or  Ii  D  A  =  A  for  each 

i.  Suppose  there  are  a  total  of  m  distinct  innermost  intervals  A,t  =  (^,  ^],  where  rfo  <  &+ 1 
and  m  <n.  Then  the  likelihood  function  (3)  is  equivalently  given  by 

L=n[(E^“‘-(E^‘T  w 

2= 1  k>U  k>r{ 

where  k  —  sup{j  :  rjj  <  Li},  ri  =  sup{j  :  r)j  <  Ri}  and  s  =  (si, sm)  denote  the  vector 
of  the  probability  weights.  The  log  likelihood  of  ( s ,  b)  is 

£(.,6)  =f>[(£  »*)•'“  -  C£  (5) 

2=1  k>U  k>ri 

Note  that  (52k>r.  Sk)eZ*b  =  1  if  r*  =  0  and  (52k>h  sk)e  '  —  0  if  k  =  m. 

C.2.  Generalized  maximum  likelihood  estimation. 

A  GMLE  of  (s,/3)  is  a  value  of  (s,  b)  that  maximizes  the  likelihood  function  (5).  We 
could  follow  the  Newton-Raphson  (NR)  algorithm  taken  by  Finkelstein  [2].  However,  this 


7 


would  involve  the  inverse  of  a  matrix  of  order  (m  +  p  —  1)  x  (m  +  p  —  1).  Since  m  can  be 
potentially  large  when  n  is  large,  the  NR  algorithm  is  not  feasible  for  a  large  data  set. 

We  advocate  a  computationally  simple  approach  by  first  grouping  the  original  data 
(. Li,Ri )  and  then  applying  a  two-step  iterative  scheme  to  obtain  the  two-step  estimators 
(TSE)  of  S0  and  /5  based  on  the  innermost  intervals  corresponding  to  the  grouped  intervals. 

In  the  first  year  of  our  research,  we  have  successfully  implemented  a  computer  software 
to  calculate  the  TSE’s  of  SQ  and  p.  A  manuscript  focusing  on  the  two-step  computation 
scheme  is  near  completion  and  will  be  submitted  for  publication  by  the  end  of  summer. 

In  our  second  year  of  research,  we  have  applied  our  two-step  estimation  procedure  to 
the  Cox  regression  analysis  of  a  long-term  prognostic  follow-up  study  involving  375  women 
with  unilateral  T1-2N0,  T1-2N1  and  T3-4  breast  cancer.  All  the  patients  were  treated  at 
Memorial  Sloan  Kettering  Cancer  Center  and  the  follow-up  are  being  conducted  at  Strang 
Cancer  Prevention  Center.  The  main  objective  of  the  study  is  to  assess  the  prognostic 
significance  of  bone  marrow  micrometastasis  (BMM)  in  predicting  relapse.  Standard  clin¬ 
ical  variables  including  nodal  status  and  tumor  diameter  were  included  in  the  Cox  model. 
Although  we  have  not  yet  established  asymptotic  normality  to  validate  the  P  values  that 
were  reported  for  the  study,  our  two-step  Cox  regression  analysis  gave  strong  indication  that 
BMM  was  not  as  predictive  of  relapse  as  previously  expected  (Osborne  and  Wong  [6]).  We 
shall  return  to  the  BMM  analysis  when  we  fully  establish  the  asymptotic  normality  of  the 
GMLE  of  the  Cox  regression  parameters.  In  our  second  year  of  research,  therefore,  we  have 
moved  ahead  of  our  statement  of  work  by  making  a  start  for  Task  8.  Since  the  BMM  relapse 
follow-up  study  provides  a  complete  and  final  data  set  that  optimally  satisfies  our  need  of 
an  empirical  example  to  illustrate  our  asymptotic  GML  procedure  for  Cox  regression,  we 
instead  to  focus  on  this  data  set  instead  of  the  example  mentioned  in  Task  8. 

Also,  in  the  second  year  of  our  research,  we  have  established  consistency  of  the  GMLE 
of  P  and  S0  (and  hence  under  the  following  assumptions: 

AS1:  So  is  arbitrary  and  each  of  the  censoring  variables, Yi,  ....,  Yk  takes  on  finitely  many 
values. 

AS2:  S0  is  arbitrary  and  each  of  the  censoring  variables, Yi,  ....,  Yk  is  continuous  and  some 
regularity  conditions  are  imposed  on  either  S0  or  the  joint  distribution  function  G  of  K,  Yi , 

Yk. 

Specifically,  under  AS1  and  AS2 


Pr{  lim  j3  =  p}  =  l, 


(6) 
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and 


(7) 


Pr{  lim  sup  | Sc(t)  -  SQ(t) |  =  0}  =  1, 

n~*°°  t£H 

where  H  denotes  the  support  set  of  Yi, ...,  Yk-  Note  that  S0(t)  is  guaranteed  to  be  consistent 
for  t  G  H,  and  not  elsewhere.  However,  the  set  H  is  not  necessarily  a  time  interval  (for 
instance,  H  may  be  a  collection  of  discrete  points).  In  order  for  the  consistency  results  to 
be  more  useful,  we  have  established  that  if  SQ  is  continuous,  and  the  support  of  Yi,  ...,  Yk 
is  dense  in  [0,  T]  for  some  T  >  0,  then  Sa(t)  is  consistent  for  all  t  G  [0,T].  The  practical 
implication  of  the  denseness  requirement  is  that  pointwise  consistency  of  S0(t)  would  hold 
only  if  all  the  subjects  in  a  follow-up  study  must  be  followed  at  very  frequent  close  intervals. 

We  have  also  established  similar  consistency  results  for  the  TSE,  with  an  added  as¬ 
sumption  that  the  maximal  length  of  the  partition  interval  tends  to  0  as  n  tends  to  oo.  The 
consistency  results  are  being  summarized  in  a  manuscript  under  preparation. 

D.  KEY  RESEARCH  ACCOMPLISHMENTS 

•  We  have  completed  Task  3  and  Task  4  pertaining  to  consistency  of  GMLE  of  Sa(t )  and 
/3,  and  also  consistency  of  TSE  for  the  same  parameters. 

•  We  have  made  a  start  of  Task  8  by  performing  a  preliminary  Cox  regression  analysis 
on  a  long-term  prognostic  relapse  follow-up  study  involving  375  breast  cancer  patients. 

E.  REPORTABLE  OUTCOMES 

•  An  abstract  presented  at  2002  ASCO  Meeting  and  published  in  the  proceedings  [3]. 

•  A  computer  program  to  calculate  the  GMLE  of  the  baseline  survival  function  S0  and 
the  Cox  regression  coefficients  /3. 

•  A  computer  program  to  calculate  the  TSE  of  SQ  and  /?. 

Both  of  these  computer  programs  have  been  made  available  for  the  public  via  the 
Internet  site  http://www.math.binghamton.edu/qyu/index.html. 

F.  CONCLUSIONS 

In  the  second  year  of  our  DOD  grant,  we  have  successfully  completed  our  research 
objectives  stated  in  Tasks  3  and  4.  We  have  demonstrated  that  the  GMLE  and  the  TSE  of 
the  regression  coefficients  is  consistent. 

The  results  which  we  have  established  will  be  useful  to  breast  cancer  researchers  pur¬ 
suing  chemoprevention  intervention  trials  involving  surrogate  endpoints  biomarkers,  and 
genetic  epidemiologists  conducting  studies  on  familial  aggregation  of  breast  cancer  and  re¬ 
lated  cancers. 
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